Triton v3.6.x iluvatar backend and 5 TLE primitives support#724
Triton v3.6.x iluvatar backend and 5 TLE primitives support#724Salamanca001 wants to merge 3 commits into
Conversation
|
|
sunnycase
left a comment
There was a problem hiding this comment.
Thanks for the work here. Before this is finalized, could you please add a concise summary of the TLE primitive implementation plan?
It would be helpful to cover the main design points, such as the abstraction/lowering flow, compiler/runtime integration points, supported operator scope, dtype/shape/backend limitations, and the validation approach. Could you also include performance data for a few representative operators, ideally with baseline vs. TLE primitive numbers, test shapes, hardware/backend configuration, and measurement methodology?
For the expected level of detail and presentation style, PR #617 could be a useful reference: #617
13abd15 to
c39b53d
Compare
[TLE][ILUVATAR] Support TLE Structure on iluvatar backendThis patch adds TLE (Triton Language Extension) structure support to the iluvatar
1. OverviewThe iluvatar backend reuses the shared TLE Python frontend under
2. Supported primitives2.1
|
| Support TLE Tests |
|---|
python/test/tle/integration/test_tle_local_store.py |
python/test/tle/unit/test_tle_gpu_local_ptr.py |
python/test/tle/unit/test_extract_tile_static_index.py |
python/test/tle/unit/test_extract_tile_dynamic_index.py |
python/test/tle/unit/test_insert_tile_static_index.py |
python/test/tle/unit/test_insert_tile_dynamic_index.py |
5. Performance data
5.1 Measurement methodology
- Benchmark sources (backend-agnostic tutorials):
python/tutorials/tle/01-fft.py
5.2 Environment
| Field | Value |
|---|---|
| Hardware | Iluvatar Corex |
| Driver / SDK | 4.5.0 |
| Torch | 2.10.0 |
| FlagTree | triton_v3.6.x |
5.3 Representative results
| N | Triton (ms) | TLE (ms) | Torch (ms) | |
|---|---|---|---|---|
| 0 | 64.0 |
0.045962 |
0.116885 |
0.022308 |
| 1 | 128.0 |
0.064135 |
0.129558 |
0.036635 |
| 2 | 256.0 |
0.135442 |
0.187827 |
0.056135 |
| 3 | 512.0 |
0.427827 |
1.050106 |
0.122798 |
| 4 | 1024.0 |
1.268423 |
3.248211 |
0.221808 |
Speedup is computed as baseline_time / TLE_time :
| Comparison | Mean |
|---|---|
| TLE FFT vs Triton FFT | 0.48x |
| TLE FFT vs Torch FFT | 0.19x |
6. Status note
This patch delivers functional support for the five TLE primitives on the
iluvatar backend (correctness validated by unit/integration tests and CI). As the
benchmark results above show, TLE paths are not yet competitive with native Triton
or Torch kernels. Performance optimization is planned for follow-up commits.
7ed5e70 to
cc55e41
Compare
cc55e41 to
c257aa5
Compare
This PR brings the Iluvatar backend support onto Triton 3.6 in FlagTree and adds Iluvatar TLE lowering support.
Included commits:
7b4cac885[BACKEND] update iluvatar backend support on triton3.6.13abd15d8[TLE][ILUVATAR] Add TLE support for alloc, local_ptr, copy, extract_tile and insert_tile.Main changes:
third_party/iluvatarbackend integration, including compiler/driver entry points, Iluvatar GPU dialect, lowering passes, target info, utility code, build wiring, and test runner.alloc,local_ptr,copy,extract_tile, andinsert_tile.